The two core analysis aims are:
to do: add in text from Heather’s outline
**to finish
Initially data was matched using an extraction from the Australian Honours List, and then matched back to wikiedia and wikipedia (see section below - TO DO - ADD LINK).
Once initial analysis was done, Alex Lum assisted in providing an extraction of wikipedia pages where an order was stated on the page. This led to the Order information being merged back into the wikidata page. This allowed us to extract further wikipedia pages. No additional pages have been created in wikipedia, but we have been able to get a bettr
The Department of the Prime Minster and Cabinet publish a list of Australian Honours recipients. This list includes all recipients of the Order of Australia.
The records were extracted from this database for all of the Order of Australia Awards issued since 1975, and extracted based on the following award levels:
More information about the Order of Australia can be found here: https://en.wikipedia.org/wiki/Order_of_Australia.
While the majority of cases are unique, there are some individuals who have been awarded multiple Orders of Australia. In the analysis shown below, all analysis that references the Honours data set, represents the number and type of awards issued. The number of awards in our data set ar XXXXinsert here###. These awards have been given to XXX individuals. A summary of the number of awards given to indivisuals is as follows
(show summary of number of awards by number of people)
expand on pr0cess here
How many have been awarded?
What is the breakdown by state?
What are the proportion Order of Australia recipients who have a wikipedia page?
What are the differences by the order level?
Are there any differences by recipient state?
How many had pages BEFORE or AFTER they received their Order of Australia? Is this different by order level?
Does receiving an order result in a spike of wikipedia pages being created?
What is the rate of creation of pages? Has there been peaks? Has it slowed at any time?
These are just small things I find along the way that may not be that important, but are intersting or that I need to follow up on
To do: insert table showing tally of records for each stage
Of the 41303 records extracted from the honours data base, the match with a wikiData extry with a connection to Australia was 2828.
Once this was filtered for matches to a wikipedia page, there was a match of 2474 articles
The honours data set was downloaded from The Australian Honours Search Facility published by the Department of the Prime Minster and Cabinet.
The records were extracted from this database for all of the Order of Australia Awards issued since 1975, and extracted based on the following award levels:
More information about the Order of Australia can be found here: https://en.wikipedia.org/wiki/Order_of_Australia
The data set represents a total of 41303 hounours with each row of the data set an individual reward recipient.
The data variables in our set are:
## [1] "AwardId" "AwardedOn" "AwardName"
## [4] "AwardAbbr" "AwardSystem" "ClaspLevel"
## [7] "ClaspText" "GazetteName" "GazetteGivenName"
## [10] "GazetteSurname" "GazetteSuburb" "GazetteState"
## [13] "GazettePostcode" "AnnouncementEvent" "Division"
## [16] "AdditionalInfo" "Citation"
The data extracted is shown for the first few rows of the data set as below:
Names of Order of Australia recipients were then passed through wikidata to gather more information, such as description of the person and their aliases.
The reason for this was to ensure that names (normally awarded to the recipient with their full name), could also be matched to any other names they are known by. The query was run using an R package (wikidataR) that accesses the wikiData API and matches not only the name the award is given to, but also any alias that is listed in a wikiData entry. For example, Bob Hawke is listed as;
Bob Hawke
For his AC, he was named as “Mr Robert James Lee HAWKE”. The API searches against his name and aliases to give us a match.
This match of Order of Australia recipients with a wikiData match returned the following information:
The description provided in the wikidata record often incuded key words such as “Australian”. Using these as a starting point, this data set was filtered to include any mention of “Australian”, as well as other key words or phrases such as “Queensland”, “Tasmania”, “New South Wales” etc. Likewise, other words such as “United States”, “Dutch”, “Spanish” etc were excluded from the list in the absensce of an Australian related term.
Once this filtering and matching was done based on the decsription field, there was a list of “unallocated” records that I sifted through manually, and allocated to the Australian list if there was a match. This was done using other information contained in the wikiData entry or in the Honours information.
A final edit was undertaken to remove cases that referred to other “non-person” items such as parks, ovals, reserves, artciles, discographies, filmographies, foundations etc that may have included the name of the award recipient.
The information is displayed below showing the head of the data set.
Using the list of Australian matches, the wikidata ID was used in a scraper to get the wikipedia page url and article ID from wikidata. This gave us 2474 wikipedia article links.
Each wikipedia page match has also been linked to a page creation date.
The above query asks to sort all revisions from oldest to most recent, and pull top timestamp , for page ID 2352403. Using a loop function, this query was scarped using the wikipedia ID of each matched article, and the timstamp was recorded.
(This query was found via a search on stackoverflow and I built a simple scraper to store the time stamp against the page id.)
All data sets were then merged together, into a final data set. Cases were merged using the following method:
The variables included on the full data set are:
## [1] "AwardId" "awardDate" "AwardName"
## [4] "AwardAbbr" "GazetteName" "suburb"
## [7] "state" "GazettePostcode" "AnnouncementEvent"
## [10] "Citation" "Gender" "personDescription"
## [13] "refurl" "date_awarded" "wpURL"
## [16] "wikipediaPageID" "name" "pageCreation"
## [19] "wdData" "wpPage" "award"
## [22] "dateDiff" "daysDiff" "prePost"
There are pros and cons to this method. It speeds up a manual process of checking if the matched records are Order of Australia award recipients. It also means that inadvertantly a record may have been included that may not have been an Order of Australia recipient, but had a name and text identifyer (such as “Australian”, “Queensland” etc) match.
For example Bob Smith has received an AM. He has no wikiData entry. Bob Smith does not have an AM, but has a wikiData entry and a description that says “Australian medical researcher”. The second Bob Smith will be included in the list that is matched to the wikipedia query, even though he has no award. If we also has a wikipedia page he will be included in the final data set.
If an award recipient’s description included another country but did not mention “Australia” or other Australian related terms, it will not included in the list. For example Jane A Smith has a wikidata entry. She has receievd an OA. Her decription says “Italian-born artist”. She would be excluded from our list based on the presence of “Italian” without an Australian qualifier. If Jane Smith has an OA and is described as an “Italian-born Australian artist” she is included on our list.
It is hypothesised that these examples are the exception rather than the rule, and the majorty of matching cases identified in process are correct. A manual check of approx 500 cases resulted in five match errors.